This notebook contains all code for parsing and analysis the protein clusters generated using MMSeq2.
A representative sequence from each of the 4 largest clusters compiled using MMSeq2 (with a percentage identity and coverage of cut-off of 70%) was extracted from a local CAZyme database using cazy_webscraper.
Each representative sequence was identified by using the GenBank accession assigned as the name of the cluster by MMSeq2.
The 4 representative sequences were analysed using BLASTP. The Python script run_blastp.py from the Python package pyrewton DOI:10.5281/zenodo.3876218) was used to run the BLASTP all-versus-all analysis.
The table compiled by BLASTP contains the following columns: - qseqid - sseqid - pident - length - mismatch - gapopen - qstart - qend - sstart - send - evalue - bitscore
In order to compare each of the pair-wise-alignment we need to calculate the Blast Score Ratio (SCR) to normalise for length.
The bitscore reported by BLAST is the sum of the qualities of the aligned symbols over the whole alignment. This is an accurate measure of the alignment strength, but long sequences tend to have higher bitscores than short sequences, even when the matches are of about the same quality. To correct for this length effect, we can calculate a normalised bitscore where:
normalised bitscore = bitscore / query length
Table 2.1 presents the raw output from BLASTP as well as the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.
| qseqid | sseqid | pident | cov | qlen | slen | alen | bitscore | evalue | bsr |
|---|---|---|---|---|---|---|---|---|---|
| CBK69950.1 | CBK69950.1 | 100.000 | 100 | 539 | 539 | 539 | 1121 | 0 | 2.0797774 |
| CBK69950.1 | AGE22437.1 | 49.171 | 100 | 539 | 567 | 543 | 520 | 0 | 0.9647495 |
| CBK69950.1 | CDG29680.1 | 49.631 | 99 | 539 | 533 | 542 | 520 | 0 | 0.9647495 |
| CBK69950.1 | QJR11213.1 | 44.383 | 99 | 539 | 534 | 543 | 412 | 0 | 0.7643785 |
| QJR11213.1 | QJR11213.1 | 100.000 | 100 | 534 | 534 | 534 | 1107 | 0 | 2.0730337 |
| QJR11213.1 | CDG29680.1 | 47.532 | 99 | 534 | 533 | 547 | 477 | 0 | 0.8932584 |
| QJR11213.1 | AGE22437.1 | 49.358 | 99 | 534 | 567 | 545 | 473 | 0 | 0.8857678 |
| QJR11213.1 | CBK69950.1 | 44.954 | 99 | 534 | 539 | 545 | 417 | 0 | 0.7808989 |
| CDG29680.1 | CDG29680.1 | 100.000 | 100 | 533 | 533 | 533 | 1118 | 0 | 2.0975610 |
| CDG29680.1 | AGE22437.1 | 66.417 | 99 | 533 | 567 | 533 | 753 | 0 | 1.4127580 |
| CDG29680.1 | CBK69950.1 | 49.631 | 99 | 533 | 539 | 542 | 520 | 0 | 0.9756098 |
| CDG29680.1 | QJR11213.1 | 47.091 | 99 | 533 | 534 | 550 | 471 | 0 | 0.8836773 |
| AGE22437.1 | AGE22437.1 | 100.000 | 100 | 567 | 567 | 567 | 1179 | 0 | 2.0793651 |
| AGE22437.1 | CDG29680.1 | 66.417 | 94 | 567 | 533 | 533 | 753 | 0 | 1.3280423 |
| AGE22437.1 | CBK69950.1 | 49.171 | 94 | 567 | 539 | 543 | 520 | 0 | 0.9171076 |
| AGE22437.1 | QJR11213.1 | 49.088 | 93 | 567 | 534 | 548 | 469 | 0 | 0.8271605 |
Figure 2.1 is an interactive plot presenting the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.
To view the specific BSR for each comparison, hover over the plot and a tooltip will appear and will present the GenBank accessions of the corresponding proteins as well as the specific BSR value (to 3dp).
Figure 2.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.